Predicting numerical outcomes using linear regression and random forest
This is an exercise for me to revise the tidymodels way of carrying out regression using linear regression (OLS) and random forest models. Only after going through the readings, my course materials, and actually trying to work through a dataset myself do I really appreciate what my Prof was trying to teach during his machine learning course..
I will be working on a dataset familiar to me - the white wine dataset.
Machine learning can be divided into supervised and unsupervised machine learning - the differentiating point is whether you know the outcome. In this case, I want to apply what I have learnt on predicting a known numerical outcome - this would be regression. If I want to predict a known categorical outcome - for eg whether the wine is red or white, then that would be classification. If I am unsure what are the types of wine, and just want to group them, then that would be clustering.
For regression, I could work on predicting the quality of wine from the white wine dataset, the quality of wine from the red wine dataset, or the quality of wine from both the red and white wine dataset.
As a start, let me try to predict the quality score of white wine from the various attributes.
Most of the points mentioned below were taken from: https://jhudatascience.org/tidyversecourse/model.html#summary-of-tidymodels.
This was a great read for me to frame my learning and see the whole picture
The general steps are outlined below:
Features/X variables
Outcome/Y variable: 12 - quality (score between 0 and 10)
What do you look out for when doing EDA?
Shape of data: How many Y and how many X variables are there? How many observations are there? The number of observations (n) and number of parameters (p) may affect your choice of models.
Type of variables: are they numerical or categorical? Or are they numerical, but actually can be transformed into categorical (eg month of the year), or are they categorical, but should be transformed into dummy variables (eg for regression)?
Are there any missing values? This may affect the modelling as some models cannot handle missing values.
Within each variable, what is the min, max, mean, median, range? Are the datapoints normally distributed, or skewed? This affects whether modelling, for eg OLS regression can be carried out, or should further transformations be carried out first?
How are the X variables related to each other? Are they correlated? What is the strength of correlation? Is there a multi-collinearity issue? These are points to be addressed for linear regression.
After identifying all the “touch-up” points required for successful modelling, the data may be pre-processed using the recipes package. This is really a powerful package that can transform the data the way you want, using single-line commands (as compared to multiple clicks of the mouse). However, it requires the user to know what steps are to be taken.
A list of preprocessig steps is given below:
https://recipes.tidymodels.org/reference/index.html#section-basic-functions
Like cooking, this is part art part science. If I want to do missing values imputation, which kind of imputation do I use? I am still learning as I go along for this step..
In a way, this preprocessing step helps you to zoom in razor sharp to the important X variables that can be used to predict Y. These X variables may exist as hidden variables that need carving out and polishing/transformation. There may be X variables that are of not much importance, so it is important to extract relevant information and keep the number of variables as small as possible without compromising on accuracy. In other words, that is called feature engineering.
Split dataset into training, testing and cross-validation. The training dataset is for you to train models with. The cross-validation dataset should be a subset of training dataset, for tuning different parameters of the model to make it an even better model. Cross-validation is useful when the number of observations isn’t big enough to split into three different datasets for training, validation and testing. Instead, the data is randomly partitioned into subsamples to be used as training dataset for one partition, and the same subsample would be used as test dataset for another partition. In repeated cross-validation, the cross-validation is repeated many times, giving random partitions of the original sample. The test dataset is for testing the trained model to see if the model is able to deliver predictive results.
Usually, you will train more than one model, and then compare the results. The choice of model could span over simple, easy to understand linear models, or difficult to interpret but accurate models (eg neural networks which is like a blackbox). The results of the trained dataset will usually not fare too badly, since the model was built using the training dataset. Different models may require different preprocessing and different types of parameter tuning.
A litmus test of whether the model works is to look at how well the model performs its predictive task when a set of completely new data is provided to the model. Models may be assessed in terms of r-sq, root mean square error or mean absolute error to judge the performance.
The indicators above give us an understanding of how accurate the model is in terms of predicting new data. A model that is over-fitted fits the training dataset well, but is unable to predict the test dataset well. A model that can fit the test dataset well, may not be accurate enough to give good predictions. This is known as the bias-variance tradeoff.
Results should be shown as visualizations/data tables to communicate the findings.
Load required packages:
white_rawdata <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv",
strip.white = T, sep = ";", header = T) %>%
as_tibble()
# save as another variable
data_white <- white_rawdata
glimpse(data_white) # 12 variables, all numerical
Rows: 4,898
Columns: 12
$ fixed.acidity <dbl> 7.0, 6.3, 8.1, 7.2, 7.2, 8.1, 6.2, 7.0,…
$ volatile.acidity <dbl> 0.27, 0.30, 0.28, 0.23, 0.23, 0.28, 0.3…
$ citric.acid <dbl> 0.36, 0.34, 0.40, 0.32, 0.32, 0.40, 0.1…
$ residual.sugar <dbl> 20.70, 1.60, 6.90, 8.50, 8.50, 6.90, 7.…
$ chlorides <dbl> 0.045, 0.049, 0.050, 0.058, 0.058, 0.05…
$ free.sulfur.dioxide <dbl> 45, 14, 30, 47, 47, 30, 30, 45, 14, 28,…
$ total.sulfur.dioxide <dbl> 170, 132, 97, 186, 186, 97, 136, 170, 1…
$ density <dbl> 1.0010, 0.9940, 0.9951, 0.9956, 0.9956,…
$ pH <dbl> 3.00, 3.30, 3.26, 3.19, 3.19, 3.26, 3.1…
$ sulphates <dbl> 0.45, 0.49, 0.44, 0.40, 0.40, 0.44, 0.4…
$ alcohol <dbl> 8.8, 9.5, 10.1, 9.9, 9.9, 10.1, 9.6, 8.…
$ quality <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, …
summary(data_white) # scale and range of values are quite different
fixed.acidity volatile.acidity citric.acid residual.sugar
Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
chlorides free.sulfur.dioxide total.sulfur.dioxide
Min. :0.00900 Min. : 2.00 Min. : 9.0
1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
Median :0.04300 Median : 34.00 Median :134.0
Mean :0.04577 Mean : 35.31 Mean :138.4
3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
Max. :0.34600 Max. :289.00 Max. :440.0
density pH sulphates alcohol
Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
quality
Min. :3.000
1st Qu.:5.000
Median :6.000
Mean :5.878
3rd Qu.:6.000
Max. :9.000
skim(data_white) # no missing values, probably need to normalize data
Name | data_white |
Number of rows | 4898 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
numeric | 12 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
fixed.acidity | 0 | 1 | 6.85 | 0.84 | 3.80 | 6.30 | 6.80 | 7.30 | 14.20 | ▁▇▁▁▁ |
volatile.acidity | 0 | 1 | 0.28 | 0.10 | 0.08 | 0.21 | 0.26 | 0.32 | 1.10 | ▇▅▁▁▁ |
citric.acid | 0 | 1 | 0.33 | 0.12 | 0.00 | 0.27 | 0.32 | 0.39 | 1.66 | ▇▆▁▁▁ |
residual.sugar | 0 | 1 | 6.39 | 5.07 | 0.60 | 1.70 | 5.20 | 9.90 | 65.80 | ▇▁▁▁▁ |
chlorides | 0 | 1 | 0.05 | 0.02 | 0.01 | 0.04 | 0.04 | 0.05 | 0.35 | ▇▁▁▁▁ |
free.sulfur.dioxide | 0 | 1 | 35.31 | 17.01 | 2.00 | 23.00 | 34.00 | 46.00 | 289.00 | ▇▁▁▁▁ |
total.sulfur.dioxide | 0 | 1 | 138.36 | 42.50 | 9.00 | 108.00 | 134.00 | 167.00 | 440.00 | ▂▇▂▁▁ |
density | 0 | 1 | 0.99 | 0.00 | 0.99 | 0.99 | 0.99 | 1.00 | 1.04 | ▇▂▁▁▁ |
pH | 0 | 1 | 3.19 | 0.15 | 2.72 | 3.09 | 3.18 | 3.28 | 3.82 | ▁▇▇▂▁ |
sulphates | 0 | 1 | 0.49 | 0.11 | 0.22 | 0.41 | 0.47 | 0.55 | 1.08 | ▃▇▂▁▁ |
alcohol | 0 | 1 | 10.51 | 1.23 | 8.00 | 9.50 | 10.40 | 11.40 | 14.20 | ▃▇▆▃▁ |
quality | 0 | 1 | 5.88 | 0.89 | 3.00 | 5.00 | 6.00 | 6.00 | 9.00 | ▁▅▇▃▁ |
data_white %>%
ggpairs() # distribution of X is quite skewed, except maybe for pH.
# outcome is not unimodal --> ?
data_white %>%
ggcorr(label = T, label_alpha = T, label_round = 2) # some collinearity issues?
Seems like OLS linear model isn’t really the best option for this case, since Y is multi-modal. It may be more suitable for classification to predict Y instead.
Nevertheless, let me compare the performance of OLS linear model with random forest, just for me to familiarise myself with the workflow for regression.
set.seed(202102212)
whitewine_split <- initial_split(data_white, prop = 0.8)
whitewine_train <- training(whitewine_split)
whitewine_test <- testing(whitewine_split)
# split training dataset for cross-validation
set.seed(20210228)
white_cv <- vfold_cv(whitewine_train) # split training dataset for tuning mtry later
Splitting the dataset into a training and testing dataset helps to minimise over-fitting of the model. Over-fitting the model would mean that the model fits the existing data very well, but is unable to predict for new data accurately.
The aim of preprocessing would be to solve the multi-collinearity issue, transform the data so that the distribution is not skewed, as well as to normalize the data.
whitewine_reciped <- whitewine_train %>%
recipe(quality ~., .) %>%
step_log(all_predictors(), -pH, offset = 1) %>% # do not use all numeric since will affect Y (outcome)
step_corr(all_predictors(), threshold = 0.5) %>% # remove variables with r-sq > 0.5
step_normalize(all_predictors()) # means centering and scaling
whitewine_reciped
Data Recipe
Inputs:
role #variables
outcome 1
predictor 11
Operations:
Log transformation on all_predictors(), -pH
Correlation filter on all_predictors()
Centering and scaling for all_predictors()
whitewine_preprocessed <- prep(whitewine_reciped, verbose = T)
oper 1 step log [training]
oper 2 step corr [training]
oper 3 step normalize [training]
The retained training set is ~ 0.29 Mb in memory.
ww_transformed <- whitewine_preprocessed %>% bake(new_data = NULL) # see preprocessed data
ww_transformed %>% as_tibble() %>% round(., digits = 3) %>% head()
┌────────────────────────────────────────────────────────────────── │ fixed.ac volatile citric.a residual chloride free.sul
│ idity .acidity cid .sugar s fur.diox
│ ide
├────────────────────────────────────────────────────────────────── │ 0.226 -0.042 0.264 1.83 -0.023 0.682
│ -0.643 0.27 0.093 -1.1 0.171 -1.47
│ 0.46 -0.471 -0.08 0.691 0.604 0.763
│ 0.46 -0.471 -0.08 0.691 0.604 0.763
│ 1.45 0.063 0.598 0.436 0.219 -0.075
│ 0.226 -0.042 0.264 1.83 -0.023 0.682
└──────────────────────────────────────────────────────────────────
Column names: fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, pH, sulphates, alcohol, quality
6/10 columns shown.# check for missing values
ww_transformed %>%
map(is.na) %>%
map_df(sum) %>%
tidy() %>%
select(column, mean) %>% # no missing values
as_tibble()
┌──────────────────────────────┐
│ column mean │
├──────────────────────────────┤
│ fixed.acidity 0 │
│ volatile.acidity 0 │
│ citric.acid 0 │
│ residual.sugar 0 │
│ chlorides 0 │
│ free.sulfur.dioxide 0 │
│ pH 0 │
│ sulphates 0 │
│ alcohol 0 │
│ quality 0 │
└──────────────────────────────┘
Column names: column, mean
ww_lr_model <- linear_reg() %>%
set_engine("lm") %>% # there are other options available, eg glmnet
set_mode("regression") # could also be classification for certain models, so just specify as best practice to be clear
ww_lr_workflow <- workflow() %>%
add_recipe(whitewine_reciped) %>%
add_model(ww_lr_model)
# fit model to training data, and get predicted values
final_ww_lm_model_fit <- fit(ww_lr_workflow, whitewine_train)
# understanding the lm model
lm_fit_output <- final_ww_lm_model_fit %>%
pull_workflow_fit() %>%
tidy() %>%
as_tibble()
lm_fit_output
┌────────────────────────────────────────────────────────────┐
│ term estimate std.error statistic p.value │
├────────────────────────────────────────────────────────────┤
│ (Intercept 5.88 0.012 488 0 │
│ ) │
│ fixed.acid -0.0459 0.0139 -3.3 0.000983 │
│ ity │
│ volatile.a -0.199 0.0126 -15.8 2.61e-54 │
│ cidity │
│ citric.aci -0.000888 0.0129 -0.0686 0.945 │
│ d │
│ residual.s 0.121 0.0142 8.53 1.98e-17 │
│ ugar │
│ chlorides -0.0194 0.0133 -1.45 0.147 │
│ free.sulfu 0.125 0.0131 9.61 1.31e-21 │
│ r.dioxide │
│ pH 0.0165 0.0139 1.19 0.234 │
│ sulphates 0.0465 0.0123 3.78 0.000161 │
│ alcohol 0.458 0.0147 31.1 4.63e-190 │
└────────────────────────────────────────────────────────────┘
Column names: term, estimate, std.error, statistic, p.value
lm_fit <- final_ww_lm_model_fit %>%
pull_workflow_fit()
lm_fit
parsnip model object
Fit time: 8ms
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) fixed.acidity volatile.acidity
5.8775198 -0.0459302 -0.1989870
citric.acid residual.sugar chlorides
-0.0008882 0.1212696 -0.0193643
free.sulfur.dioxide pH sulphates
0.1254698 0.0165026 0.0465193
alcohol
0.4576764
# Looking at the fitted values:
lm_fitted_values <- lm_fit$fit$fitted.values
# another way, from workflow
lm_wf_fitted_values <-
broom::augment(lm_fit$fit, data = whitewine_train) %>%
select(quality, .fitted: .std.resid)
glimpse(lm_wf_fitted_values)
Rows: 3,919
Columns: 6
$ quality <int> 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 7, 5, 6, 8, 5, 8…
$ .fitted <dbl> 5.469029, 5.165576, 5.863142, 5.863142, 5.686045,…
$ .hat <dbl> 0.0016112016, 0.0019222325, 0.0009142831, 0.00091…
$ .sigma <dbl> 0.7532844, 0.7532139, 0.7533292, 0.7533292, 0.753…
$ .cooksd <dbl> 8.032116e-05, 2.368036e-04, 3.023818e-06, 3.02381…
$ .std.resid <dbl> 0.705488442, 1.108851543, 0.181776968, 0.18177696…
# looking at variable importance
vip_lm <- final_ww_lm_model_fit %>%
pull_workflow_fit() %>% # extracts the model information
vip(num_features = 10,
aesthetics = list(fill = "deepskyblue4")) + # most important factor is alcohol +
labs(title = "Variable Importance: Linear Regression") +
theme_few() +
theme(axis.text = element_text(face = "bold", size = 14))
vip_lm
rf_model <- rand_forest() %>%
set_args(mtry = tune()) %>%
set_mode(mode = "regression") %>%
set_engine(engine = "ranger", importance = "impurity")
rf_model
Random Forest Model Specification (regression)
Main Arguments:
mtry = tune()
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
rf_workflow <- workflow() %>%
add_recipe(whitewine_reciped) %>%
add_model(rf_model)
rf_grid <- expand.grid(mtry = 3:7) # choose sqrt(no. of variables) usually
tuned_rf_results <- rf_workflow %>%
tune_grid(resamples = white_cv, # using cv dataset from training dataset
grid = rf_grid,
metrics = metric_set(rmse, rsq, mae))
model_results <- tuned_rf_results %>%
collect_metrics()
finalized_rf_param <- tuned_rf_results %>%
select_best(metric = "rmse") %>%
as_tibble()
finalized_rf_param #M TRY = 3
┌───────────────────────────────┐
│ mtry .config │
├───────────────────────────────┤
│ 3 Preprocessor1_Model1 │
└───────────────────────────────┘
Column names: mtry, .config
rf_model_b <- rand_forest() %>%
set_args(mtry = 3) %>%
set_engine(engine = "ranger", importance = "impurity") %>%
set_mode(mode = "regression")
rf_workflow_b <- workflow() %>%
add_recipe(whitewine_reciped) %>%
add_model(rf_model_b)
final_ww_rf_model_fit <- fit(rf_workflow_b, whitewine_train)
# understanding the rf model
# for random forest, need to set importance = impurity in set_engine() to extract this
vip_rf <- final_ww_rf_model_fit %>%
pull_workflow_fit() %>% # extracts the model information
vip(num_features = 10,
aesthetics = list(fill = "darkorange"))+ # most important factor is alcohol
labs(title = "Variable Importance: Random Forest") +
theme_few() +
theme(axis.text = element_text(face = "bold", size = 14))
vip_rf
gridExtra::grid.arrange(vip_lm, vip_rf, nrow = 2)
Alcohol content was the most important variable for both OLS and random forest models.
results_train <- final_ww_lm_model_fit %>%
predict(new_data = whitewine_train) %>% # use actual train data, not preprocessed data
mutate(truth = whitewine_train$quality,
model = "lm") %>%
bind_rows(final_ww_rf_model_fit %>%
predict(new_data = whitewine_train) %>%
mutate(truth = whitewine_train$quality,
model = "rf"))
results_train %>%
group_by(model) %>%
metrics(truth = truth, estimate = .pred) %>%
as_tibble()
┌──────────────────────────────────────────┐
│ model .metric .estimator .estimate │
├──────────────────────────────────────────┤
│ lm rmse standard 0.752 │
│ rf rmse standard 0.277 │
│ lm rsq standard 0.282 │
│ rf rsq standard 0.934 │
│ lm mae standard 0.584 │
│ rf mae standard 0.198 │
└──────────────────────────────────────────┘
Column names: model, .metric, .estimator, .estimate
results_test <- final_ww_lm_model_fit %>%
predict(new_data = whitewine_test) %>%
mutate(truth = whitewine_test$quality,
model = "lm") %>%
bind_rows(final_ww_rf_model_fit %>%
predict(new_data = whitewine_test) %>%
mutate(truth = whitewine_test$quality,
model = "rf"))
results_test %>%
group_by(model) %>%
metrics(truth = truth, estimate = .pred) %>%
as_tibble()
┌──────────────────────────────────────────┐
│ model .metric .estimator .estimate │
├──────────────────────────────────────────┤
│ lm rmse standard 0.737 │
│ rf rmse standard 0.589 │
│ lm rsq standard 0.295 │
│ rf rsq standard 0.56 │
│ lm mae standard 0.581 │
│ rf mae standard 0.435 │
└──────────────────────────────────────────┘
Column names: model, .metric, .estimator, .estimate
When comparing rmse, rf has lower rmse in training dataset but the rmse value increased in the test dataset –> overfitting and cannot predict as well.This was the same for other indicators rsq and mean absolute error.
Bear in mind that in the first place, the outcome variable Y was multi-modal. This may be the reason why OLS wasn’t a suitable learner.
results_train %>%
ggplot(aes(x = truth, y = .pred)) +
geom_point() +
geom_smooth(method = "lm") +
geom_abline(col = "darkorange") +
labs(x = "actual values (truth)",
y = "predicted values",
title = "Training dataset") +
scale_x_continuous(breaks = c(1:10)) +
facet_wrap( model ~ ., ncol = 2) +
theme_few()
results_test %>%
ggplot(aes(x = truth, y = .pred)) +
geom_point() +
geom_smooth(method = "lm") +
geom_abline(col = "darkorange") +
labs(x = "actual values (truth)",
y = "predicted values",
title = "Testing dataset") +
scale_x_continuous(breaks = c(1:10)) +
facet_wrap( model ~ ., ncol = 2) +
theme_few()
It took me a while to piece different pieces of the jigsaw together to see the whole picture for machine learning. Initially, I will be carrying out EDA blindly, simply using skim() because it is a convenient function, but not fully understanding what I should be looking out for. I would be doing pre-processing steps at random, depending on what I saw from other websites. Finally, I saw the light that the purpose of doing EDA was to understand what I should be doing for preprocessing!
It is always good to start with the simple OLS when I am learning regression. There are assumptions that must be met before doing OLS – these could be checked using the gvlma package, and you can carry out the necessary transformations before doing OLS. There are other types of linear regression, for example generalized linear model (GLM), which I should try as well.
The order of carrying out preprocessing steps matter!
The choice of all_numerical, all_predictors in the recipe step matters! In this case, all_numerical includes the Y variable. Although Y is multimodal, it is not skewed, so I should not log transform it (which is what would happen if I were to use step_log(all_numerical())). If I log-transformed Y, I would run into errors further along the script, as there are some bugs regarding predict function if Y is transformed. The OLS model performed relatively consistently in both training and test dataset. However, the RF model performed better in the training dataset, but performance was poorer in the test dataset. This suggested that the RF model, in this case, had over-fitting issues.
“There is only one corner of the universe you can be certain of improving, and that’s your own self.” - ― Aldous Huxley
This is just the beginning of my learning journey!
For attribution, please cite this work as
lruolin (2021, March 1). pRactice corner: Tidy Models - Regression. Retrieved from https://lruolin.github.io/myBlog/posts/20210301_tidy models regression 1/
BibTeX citation
@misc{lruolin2021tidy, author = {lruolin, }, title = {pRactice corner: Tidy Models - Regression}, url = {https://lruolin.github.io/myBlog/posts/20210301_tidy models regression 1/}, year = {2021} }